This is a week report for capstone projetc of JHU Capstone Course by Rongbin Ye, as a part of the data scientist specialization on Coursera.This report will provide a report of summary statistics about the data sets, report some interesting findings via term frequency, N gram and word cloud. Furthermore, based on the explainatory data analysis outcome, this report will propose a plan of capstone project, including a prediction algorithm for typing recommendation and a Shiny app on the AWS.
As discussed in the begining, this report will provide a milestone plan throughoout this whole capstone project. The scheduled time is in seven weeks. Yet, in regarding the time contraint, the milestones has been set in an accelerated manner, which entails tighted up plans to clean up the data and schedule of model development. The major deliverables are two: 1. An algorithm of recommendation the correlated the words in English, based on the given dataset. 2. A Shiny App for the usage of the users to interact with.
The milestone has been set to be following four sections: > Week 1 & 2: Data Familarization + Explantory Data Cleaning > Week 3 & 4: Text Mining: Extract and identify the key patterns related to the usage habbits of english writers > week 5 & 6: Model Development: An Algorithm to be developed and tuned in this process > Week 7: Develop a Shiny App and Delopy
library(tidyverse)
library(tm)
library(lexicon)
library(stringr)
library(stopwords)
library(tidytext)
library(textstem)
library(tidyr)
In this section, using the connection, the txt file has been read into R lines by lines. The texts are stored in characters’ form. Based on these three characters chuncks, the author is able to conduct the preliminary data cleaning and text preprocessing for explantory data analysis
# read in multiple lines into one data frame: blogs
blogs_con <- file("~/Downloads/final/en_US/en_US.blogs.txt")
blogs <- readLines(con = blogs_con)
close(blogs_con)
# all the blogs has been loaded in properly
# read in multiple lines into one data frame: twitters
twitters_con <- file("~/Downloads/final/en_US/en_US.twitter.txt")
twitter <- readLines(con = twitters_con)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
close(twitters_con)
# all the twitters are read into the properly
# read in multiple lines into one data frame: news
news_con <- file("~/Downloads/final/en_US/en_US.news.txt")
news <- readLines(con = news_con)
close(news_con)
# all the twitters are read into the properly
After loading in the data, let us look at the basic shape of three data sets.
len_news <- length(news)
len_blogs <- length(blogs)
len_twitters <- length(twitter)
The length of news is 1010242 in sentences. The length of blogs is 899288 in sentences. The length of twitters is 2360148 in sentences. Let us have a close look at the words.
In order to conduct analysis, three text will be cleaned. The major cleaning process includes lowercase, strip spaces on both sides, taking out the punctuation, lemmatization, and tokenization. Eventually, the text chucks are expected to be broken down into tokens for further analysis and comparison at the level of sentences and words.
blogs <- blogs %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
news <- news %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
blogs_df <- tibble(line = 1:length(blogs), text = blogs)
tokens_blogs <- blogs_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_blogs
## # A tibble: 349,020 x 2
## word n
## <chr> <int>
## 1 time 105751
## 2 day 70910
## 3 people 61543
## 4 love 58575
## 5 life 48685
## 6 feel 48000
## 7 book 41394
## 8 start 38799
## 9 im 37863
## 10 week 37765
## # … with 349,010 more rows
Surprisingly, in the blog, five words bloggers wrote most about are: time, people, day, love and life. Such a philosophical and metaphysical finding demonstrate the salt of the earth, which shows the potential topics. One of the preliminary conclusions is that these blogs experts m
news_df <- tibble(line = 1:length(news), text = news)
tokens_news <- news_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_news
## # A tibble: 284,001 x 2
## word n
## <chr> <int>
## 1 time 65807
## 2 people 48738
## 3 game 48072
## 4 school 46734
## 5 day 43829
## 6 play 43596
## 7 city 41282
## 8 include 39438
## 9 team 38495
## 10 call 34729
## # … with 283,991 more rows
Meanwhile, it seems the topics in the news are more about time, people, game, school and day. Indeed, one of the preliminary thoughts is that the news covered might be the sport news. Despite not covering any specific sports, the news is about the season, game, and team, which supports the preliminary finding.
tokens_news$org <-"news"
tf_idf_news <- tokens_news %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_news %>% arrange(desc(tf_idf))
## # A tibble: 284,001 x 6
## word n org tf idf tf_idf
## <chr> <int> <chr> <dbl> <dbl> <dbl>
## 1 time 65807 news 0.00431 0 0
## 2 people 48738 news 0.00319 0 0
## 3 game 48072 news 0.00314 0 0
## 4 school 46734 news 0.00306 0 0
## 5 day 43829 news 0.00287 0 0
## 6 play 43596 news 0.00285 0 0
## 7 city 41282 news 0.00270 0 0
## 8 include 39438 news 0.00258 0 0
## 9 team 38495 news 0.00252 0 0
## 10 call 34729 news 0.00227 0 0
## # … with 283,991 more rows
To further consider the frequencies of words, as one can see that, including the adjusted term frequency does not help us to identify the keywords. The reason is that all these words are common words. Hence, using the term frequency solely is enough to help us to understand the korpus already.
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)
### Twitter
twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
twitters_df <- tibble(line = 1:length(twitter), text = twitter)
tokens_twitters <- twitters_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_twitters
## # A tibble: 455,532 x 2
## word n
## <chr> <int>
## 1 im 157965
## 2 love 120102
## 3 day 108818
## 4 rt 88743
## 5 time 84990
## 6 lol 66791
## 7 follow 66295
## 8 people 52817
## 9 happy 49539
## 10 tonight 43912
## # … with 455,522 more rows
tokens_twitters$org <- "twitters"
tokens_blogs$org <-"blogs"
tf_idf_twitter <- tokens_twitters %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_blogs <- tokens_blogs %>% tidytext::bind_tf_idf(word,org, n)
Yet, one of the issues of pure In order to adjust the influence of high frequent words, I adopt the inverse term frequency and expand the n into tf-idf model. Using the blind_df_idf function, the tf-idf metrics are provided.
tokens_all <- rbind(tokens_news, tokens_twitters)
tokens_all <- rbind(tokens_all, tokens_blogs)
tf_idf_all <- tokens_all %>% tidytext::bind_tf_idf(word,org, n)
news_20 <- top_n(tf_idf_news, 20, wt = tf) %>% select(word)
blogs_20 <- top_n(tf_idf_twitter, 20, wt = tf) %>% select(word)
twitter_20 <- top_n(tf_idf_blogs,20, wt = tf) %>% select(word)
all_20 <- cbind(news_20, blogs_20)
all_20 <- cbind(all_20, twitter_20)
colnames(all_20) <- c("Top News", "Top Blogs", "Top Tweets")
all_20
## Top News Top Blogs Top Tweets
## 1 time im time
## 2 people love day
## 3 game day people
## 4 school rt love
## 5 day time life
## 6 play lol feel
## 7 city follow book
## 8 include people start
## 9 team happy im
## 10 call tonight week
## 11 percent night write
## 12 home feel leave
## 13 run watch read
## 14 million hope world
## 15 county youre call
## 16 start game don
## 17 week life home
## 18 season tweet friend
## 19 win start lot
## 20 company week post
After exploring the frequencies of the single words, let us look at the connection among words. In this session, the exploration will focus in two groups of the connections: connection of two words and three words seperately. ## Bigram Analysis ### Blogs
bigram_blogs <- blogs_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_blogs_final <- bigram_blogs %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
T10_blogs <- top_n(bigram_blogs_final, 10, wt = n)
T10_blogs
## # A tibble: 10 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 spin dry 3767
## 2 week ago 1860
## 3 ice cream 1472
## 4 blog post 1402
## 5 social medium 1332
## 6 jesus christ 1318
## 7 month ago 1234
## 8 south africa 1206
## 9 spend time 1075
## 10 olive oil 1041
bigram_news <- news_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_news_final <- bigram_news %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
T10_news <- top_n(bigram_news_final, 10, wt = n)
T10_news
## # A tibble: 10 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 st louis 8947
## 2 los angeles 5189
## 3 san francisco 4356
## 4 health care 3586
## 5 school district 3073
## 6 vice president 2884
## 7 san diego 2638
## 8 police officer 2292
## 9 white house 2276
## 10 executive director 2159
Well, the bigram uncover the geological coverage of the news that presented in the data. St.Lous, Los Angeles, San Francisco, San diego and probably DC(white house) are three major regions these news reports are covering majorly.
bigram_twitters <- twitters_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_twitters_final <- bigram_twitters %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
T10_twitters <- top_n(bigram_twitters_final, 10, wt = n)
T10_twitters
## # A tibble: 10 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 happy birthday 8355
## 2 mother day 5429
## 3 im gonna 4145
## 4 social medium 3773
## 5 happy mother 3371
## 6 stay tune 2532
## 7 san diego 2198
## 8 im glad 2057
## 9 rt rt 2051
## 10 happy friday 1941
Through the exploration, one could discover there are some key patterns in these characters. The summary of major keywords are presented here in the form of cloud. ## Word Cloud - All
All_join_summary <- inner_join(tf_idf_news, tf_idf_blogs, by = "word")
All_join_summary <- inner_join(All_join_summary, tf_idf_twitter, by = "word")
#All_join_summary %>% arrange(desc())
All_join_summary$tf_idf_all <- All_join_summary$tf_idf.x + All_join_summary$tf_idf.y
All_join_summary$tf_all <- All_join_summary$tf.x + All_join_summary$tf.y + All_join_summary$tf
All_join_summary <- All_join_summary %>% arrange(desc(tf_all))
All_join_summary
## # A tibble: 72,225 x 18
## word n.x org.x tf.x idf.x tf_idf.x n.y org.y tf.y idf.y tf_idf.y
## <chr> <int> <chr> <dbl> <dbl> <dbl> <int> <chr> <dbl> <dbl> <dbl>
## 1 time 65807 news 4.31e-3 0 0 105751 blogs 0.00764 0 0
## 2 day 43829 news 2.87e-3 0 0 70910 blogs 0.00512 0 0
## 3 im 17412 news 1.14e-3 0 0 37863 blogs 0.00273 0 0
## 4 love 13609 news 8.90e-4 0 0 58575 blogs 0.00423 0 0
## 5 peop… 48738 news 3.19e-3 0 0 61543 blogs 0.00444 0 0
## 6 feel 19514 news 1.28e-3 0 0 48000 blogs 0.00347 0 0
## 7 life 20825 news 1.36e-3 0 0 48685 blogs 0.00352 0 0
## 8 start 31863 news 2.08e-3 0 0 38799 blogs 0.00280 0 0
## 9 week 31465 news 2.06e-3 0 0 37765 blogs 0.00273 0 0
## 10 foll… 13113 news 8.58e-4 0 0 17120 blogs 0.00124 0 0
## # … with 72,215 more rows, and 7 more variables: n <int>, org <chr>, tf <dbl>,
## # idf <dbl>, tf_idf <dbl>, tf_idf_all <dbl>, tf_all <dbl>
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)
## Word Cloud - Blogs
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud2::wordcloud2(data = tf_idf_blogs)
## Word Cloud - Twitter
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_twitter$word, freq = tf_idf_twitter$tf, max.words = 10, colors = TRUE)
After our analysis of bigram, many meaningful connections among words were discovered and one could use this as the foundation of the development of the recommendation system for users in the future, if there is a demand for developing a recommendation system based on the existing literature.
The exploration in the distribution of words and connection of words enables us to further explore the connections between the words and words, in sentences and words. The bigram and trigram shed lights on the recommendation system.